from hashlib import sha1
import altair as alt
import pandas as pd
import numpy as np
from hashlib import sha1
alt.data_transformers.disable_max_rows()
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 2
      1 from hashlib import sha1
----> 2 import altair as alt
      3 import pandas as pd
      4 import numpy as np

ModuleNotFoundError: No module named 'altair'

Use read_csv from pandas to load the data from the data folder and assign it to a variable named gapminder_df.#

Make sure to parse any time columns using the parse_dates argument.#

gapminder_df = pd.read_csv("C:/Users/sindi/Downloads/world-data-gapminder.csv", parse_dates=['year'])
gapminder_df.head()
country year population region sub_region income_group life_expectancy income children_per_woman child_mortality pop_density co2_per_capita years_in_school_men years_in_school_women
0 Afghanistan 1800-01-01 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
1 Afghanistan 1801-01-01 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
2 Afghanistan 1802-01-01 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
3 Afghanistan 1803-01-01 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
4 Afghanistan 1804-01-01 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
plot_a = alt.Chart(gapminder_df).mark_line().encode(
    x='year', 
    y='population', 
    color='region').properties(
    title='Plot A',width=300, height=200)

plot_b = alt.Chart(gapminder_df).mark_line().encode(
    x='year', 
    y='sum(population)', 
    color='region').properties(
    title='Plot B',width=300, height=200)

plot_c = alt.Chart(gapminder_df).mark_area().encode(
    x='year', 
    y='sum(population)', 
    color='region').properties(
    title='Plot C',width=300, height=200)

plot_d = alt.Chart(gapminder_df).mark_circle().encode(
    alt.X('year', scale=alt.Scale(zero=False)),
    alt.Y('population'),
    alt.Color('region')).properties(
    title='Plot D',width=300, height=200)

alt.vconcat(alt.hconcat(
    plot_a, plot_b
).resolve_scale(
    color='independent'),  
alt.hconcat(
    plot_c, plot_d
).resolve_scale(
    color='independent'))
plot_a
plot_b
plot_d

As we can see, the dataframe is difficult to digest#

gapminder_every20 = gapminder_df[gapminder_df["year"].isin(['1918','1938','1958','1978','1998','2018'])]
gapminder_every20
country year population region sub_region income_group life_expectancy income children_per_woman child_mortality pop_density co2_per_capita years_in_school_men years_in_school_women
118 Afghanistan 1918-01-01 5700000 Asia Southern Asia Low 7.89 849 7.00 468.0 NaN NaN NaN NaN
138 Afghanistan 1938-01-01 6900000 Asia Southern Asia Low 31.30 963 7.33 443.0 NaN NaN NaN NaN
158 Afghanistan 1958-01-01 8680000 Asia Southern Asia Low 37.20 1180 7.48 375.0 13.30 0.0380 NaN NaN
178 Afghanistan 1978-01-01 13200000 Asia Southern Asia Low 45.00 1190 7.45 258.0 20.30 0.1630 1.67 0.27
198 Afghanistan 1998-01-01 18900000 Asia Southern Asia Low 50.10 956 7.62 137.0 28.90 0.0552 2.76 0.55
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38901 Zimbabwe 1938-01-01 2500000 Africa Sub-Saharan Africa Low 34.80 1180 6.75 345.0 NaN 1.1100 NaN NaN
38921 Zimbabwe 1958-01-01 3520000 Africa Sub-Saharan Africa Low 53.00 1950 7.04 159.0 9.09 NaN NaN NaN
38941 Zimbabwe 1978-01-01 6700000 Africa Sub-Saharan Africa Low 56.70 2250 7.27 108.0 17.30 1.3900 6.20 4.55
38961 Zimbabwe 1998-01-01 11900000 Africa Sub-Saharan Africa Low 49.10 2750 4.16 95.9 30.70 1.2000 8.80 7.39
38981 Zimbabwe 2018-01-01 16900000 Africa Sub-Saharan Africa Low 60.20 1950 3.61 55.5 43.70 NaN NaN NaN

1068 rows × 14 columns

The combined plot is rather cluttered#

family_plot = alt.Chart(gapminder_every20).mark_circle().encode(
alt.X("children_per_woman", title="Children per woman"),
alt.Y("child_mortality", title="Child mortality"), 
alt.Color("income_group", title="Income Group")

).properties(title="Child mortality vs Children born per woman")

family_plot

It is much easier to make sense of when faceted#

family_plot_faceted = family_plot.facet("year", columns=3)
family_plot_faceted

When the plots are faceted#

Some of the equations used in the analysis#

sample mean \begin{equation} \bar{y} = \frac{1}{n}\sum_{i=1}^n y_i \end{equation}

Here is the sample variance: \begin{equation} \sigma^2 = \frac{\sum\limits_{i=1}^{n}(y_i – \bar{y})^2} {n – 1} \end{equation}